Petabyte Scale Data Mining: Dream or Reality?

نویسندگان

  • Alexander S. Szalay
  • Jim Gray
  • Jan vandenBerg
چکیده

Science is becoming very data intensive. Today’s astronomy datasets with tens of millions of galaxies already present substantial challenges for data mining. In less than 10 years the catalogs are expected to grow to billions of objects, and image archives will reach Petabytes. Imagine having a 100GB database in 1996, when disk scanning speeds were 30MB/s, and database tools were immature. Such a task today is trivial, almost manageable with a laptop. We think that the issue of a PB database will be very similar in six years. In this paper we scale our current experiments in data archiving and analysis on the Sloan Digital Sky Survey data six years into the future. We analyze these projections and look at the requirements of performing data mining on such data sets. We conclude that the task scales rather well: we could do the job today, although it would be expensive. There do not seem to be any show-stoppers that would prevent us from storing and using a Petabyte dataset six years from today.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

In recent times, Hadoop based on the MapReduce model has gained considerable attention because the features of the data preprocessing techniques are not timeconsuming and are suitable for processing large-scale data. In particular, MapReduce is emerging as an important programming model for developing distributed dataprocessing applications such as web indexing, data mining, log file analysis, ...

متن کامل

1 Chapter 1 Pattern Recognition in Time Series

Massive amount of time series data are generated daily, in areas as diverse as astronomy, industry, sciences, and aerospace, to name just a few. One obvious problem of handling time series databases concerns with its typically massive size—gigabytes or even terabytes are common, with more and more databases reaching the petabyte scale. Most classic data mining algorithms do not perform or scale...

متن کامل

Satya Sridhar Dusi Venkata

COMPUTING IN SCIENCE & ENGINEERING Large-scale computational simulations of physical phenomena produce enormous data sets, often in the terabyte and petabyte range. Unfortunately, advances in data management and visualization techniques have not kept pace with the growing size and complexity of such data sets. One paradigm for effective large-scale visualization is browsing regions containing s...

متن کامل

Panel: One Platform for Mining Structured & Unstructured Data: Dream or Reality?

2. INTRODUCTION Although enterprises commonly utilize sophisticated data integration technology and business intelligence tools for analysis of structured data, analysis of unstructured data is a separate process and is often limited to capabilities supported by a search engine. Users have separate and vastly different interfaces for structured and unstructured data: Business Intelligence for s...

متن کامل

A Vision for PetaByte Data Management and Analysis Services for the Arecibo Telescope

We survey the initial steps of a project to build a data management and data mining system for astronomy data generated by the Arecibo Telescope. The total amount of data that our project will have to manage will approach one Petabyte over five years. We describe some of the scientific challenges from the astronomy side, and we discuss initial thoughts on how to address these challenges through...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره cs.DB/0208013  شماره 

صفحات  -

تاریخ انتشار 2002